单细胞免疫组库数据分析||Seurat整合单细胞转录组与VDJ数据
分享是一种态度
作者 | 周运来
男,
一个长大了才会遇到的帅哥,
稳健,潇洒,大方,靠谱。
一段生信缘,一棵技能树,
一枚大型测序工厂的螺丝钉,
一个随机森林中提灯觅食的津门旅客。
在做10X单细胞免疫组库分析的是往往是做一部分BCR、TCR做一部分5‘转录组,那么怎样才能把两者结合到一起呢?
今天我们尝试用我们的趁手工具做一下整合分析。
首先是下载数据,我们从10X官方的dataset中下载数据:https://support.10xgenomics.com/single-cell-vdj/datasets/3.1.0/vdj_v1_hs_pbmc3
在下载页面有关于这个样本的基本介绍,如这个数据集根据单细胞V(D)J试剂试剂盒使用指南和细胞表面蛋白特征条形码技术(CG000186),从标记的细胞中扩增出cDNA,生成5’基因表达、细胞表面蛋白、TCR富集和Ig富集文库。这里是不是应该把本文的题目改成:Seurat整合单细胞转录组与VDJ数据与膜蛋白数据啊。
读入数据:
1rm(list =ls())
2library(Seurat)
3data <- Read10X(data.dir = "F:\\VDJ\\filtered_feature_bc_matrix")
410X data contains more than one type and is being returned as a list containing matrices of each type.
上来就报错:
1seurat_object = CreateSeuratObject(counts = data)
2Error in CreateAssayObject(counts = counts, min.cells = min.cells, min.features = min.features) :
3 No cell names (colnames) names present in the input matrix
4>
原来我们的feature中不仅有基因表达还有膜蛋白的信息,分开来读好了:
1seurat_object = CreateSeuratObject(counts = data$`Gene Expression`)
2seurat_object[['Protein']] = CreateAssayObject(counts = data$`Antibody Capture`)
我们熟悉的seurat对象又回来了:
1seurat_object
2
3An object of class Seurat
433555 features across 7231 samples within 2 assays
5Active assay: RNA (33538 features)
6 1 other assay present: Protein
注意这里是两个数据, RNA ,Protein:
当我们需要分析哪一部分的时候,用DefaultAssay(seurat_object)
来指定,这里我们只分析RNA的数据啊,茂密的森林伸出两条路,我们也看一眼膜蛋白的数据吧:
1 seurat_object@assays$Protein
2Assay data with 17 features for 7231 cells
3First 10 features:
4 CD3-UCHT1-TotalC, CD19-HIB19-TotalC, CD45RA-HI100-TotalC, CD4-RPA-T4-TotalC, CD8a-RPA-T8-TotalC, CD14-M5E2-TotalC, CD16-3G8-TotalC, CD56-QA17A16-TotalC, CD25-BC96-TotalC,
5CD45RO-UCHL1-TotalC
接着我们读入BCR的数据:
1bcr <- read.csv("F:\\VDJ\\filtered_contig_annotations.csv")
2bcr[1:4,]
3 barcode is_cell contig_id high_confidence length chain v_gene d_gene j_gene c_gene full_length productive cdr3
41 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_1 True 556 IGK IGKV1D-33 None IGKJ4 IGKC True True CQQYDNLPLTF
52 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_2 True 516 IGH IGHV4-59 None IGHJ5 IGHM True True CARGGNSGLDPW
63 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_3 True 569 IGK IGKV2D-30 None IGKJ1 IGKC True False CWGLLLHARYTLAWTF
74 AAACCTGTCAGGTAAA-1 True AAACCTGTCAGGTAAA-1_contig_1 True 548 IGK IGKV1-12 None IGKJ4 IGKC True True CQQANSFPLTF
8 cdr3_nt reads umis raw_clonotype_id raw_consensus_id
91 TGTCAACAGTATGATAATCTCCCGCTCACTTTC 7410 61 clonotype7 clonotype7_consensus_1
102 TGTGCGAGAGGCGGGAACAGTGGCTTAGACCCCTGG 1458 18 clonotype7 clonotype7_consensus_2
113 TGTTGGGGTTTATTACTGCATGCAAGGTACACACTGGCCTGGACGTTC 2045 17 clonotype7 None
124 TGTCAACAGGCTAACAGTTTCCCGCTCACTTTC 140 1 clonotype2 clonotype2_consensus_2
为了保持barcode一致,处理一下barcode的命名:
1bcr$barcode <- gsub("-1", "", bcr$barcode)
2bcr <- bcr[!duplicated(bcr$barcode), ]
3names(bcr)[names(bcr) == "raw_clonotype_id"] <- "clonotype_id"
4table(bcr$v_gene)
5
6 IGHV1-18 IGHV1-2 IGHV1-24 IGHV1-3 IGHV1-45 IGHV1-46 IGHV1-58 IGHV1-69 IGHV1-69D IGHV1-8 IGHV1/OR15-1 IGHV1/OR21-1 IGHV2-26 IGHV2-5 IGHV2-70
7 30 48 5 14 1 18 2 1 26 8 1 1 9 28 8
8 IGHV3-11 IGHV3-13 IGHV3-15 IGHV3-16 IGHV3-21 IGHV3-23 IGHV3-30 IGHV3-33 IGHV3-43 IGHV3-48 IGHV3-49 IGHV3-53 IGHV3-64 IGHV3-64D IGHV3-66
9 20 7 29 1 43 95 40 50 10 25 10 13 5 4 2
10 IGHV3-7 IGHV3-72 IGHV3-73 IGHV3-74 IGHV4-30-2 IGHV4-31 IGHV4-34 IGHV4-39 IGHV4-4 IGHV4-59 IGHV4-61 IGHV5-10-1 IGHV5-51 IGHV6-1 IGHV7-4-1
11 15 6 7 25 9 4 30 55 22 76 9 5 21 4 8
12 IGKV1-12 IGKV1-16 IGKV1-17 IGKV1-27 IGKV1-33 IGKV1-37 IGKV1-39 IGKV1-5 IGKV1-6 IGKV1-8 IGKV1-9 IGKV1D-13 IGKV1D-33 IGKV1D-37 IGKV1D-39
13 29 10 8 9 9 2 6 42 3 16 11 1 15 6 65
14 IGKV1D-43 IGKV1D-8 IGKV2-24 IGKV2-28 IGKV2-30 IGKV2D-26 IGKV2D-28 IGKV2D-29 IGKV2D-30 IGKV2D-40 IGKV3-11 IGKV3-15 IGKV3-20 IGKV3-7 IGKV3D-11
15 5 4 6 26 14 5 1 9 1 2 39 58 89 2 1
16 IGKV3D-15 IGKV3D-20 IGKV4-1 IGKV5-2 IGKV6-21 IGKV6D-21 IGLV1-40 IGLV1-44 IGLV1-47 IGLV1-51 IGLV10-54 IGLV2-11 IGLV2-14 IGLV2-18 IGLV2-23
17 9 2 60 8 2 1 31 37 26 24 3 24 60 1 20
18 IGLV2-8 IGLV3-1 IGLV3-10 IGLV3-12 IGLV3-16 IGLV3-19 IGLV3-21 IGLV3-25 IGLV3-27 IGLV3-9 IGLV4-3 IGLV4-60 IGLV4-69 IGLV5-37 IGLV5-39
19 20 35 15 1 2 15 27 16 4 4 2 5 15 7 1
20 IGLV5-45 IGLV5-48 IGLV5-52 IGLV6-57 IGLV7-43 IGLV7-46 IGLV8-61 IGLV9-49 None
21 9 1 2 14 5 10 15 9 173
22>
读入克隆型信息:
1clono <- read.csv("F:\\VDJ\\vdj_v1_hs_pbmc3_b_clonotypes.csv")
2head(clono)
3
4 clonotype_id frequency proportion cdr3s_aa cdr3s_nt
51 clonotype1 7 0.009067358 IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
62 clonotype2 3 0.003886010 IGK:CQQANSFPLTF;IGK:CQQYDNWPPYTF IGK:TGTCAACAGGCTAACAGTTTCCCGCTCACTTTC;IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
73 clonotype5 2 0.002590674 IGL:CQARDSSTVVF IGL:TGTCAGGCGCGGGACAGCAGCACTGTGGTATTC
84 clonotype6 2 0.002590674 IGH:CARPGTTGTTGLKNW;IGK:CQQYNNWPLTF IGH:TGTGCGAGACCCGGTACAACTGGAACGACGGGTTTAAAAAACTGG;IGK:TGTCAGCAGTATAATAACTGGCCTCTCACCTTC
95 clonotype4 2 0.002590674 IGH:CAKSFFTGTGQFHYW;IGK:CQQYSTYSQTF IGH:TGTGCGAAATCATTTTTTACCGGGACAGGACAGTTTCACTATTGG;IGK:TGCCAACAGTATAGTACTTATTCTCAAACGTTC
106 clonotype3 2 0.002590674 IGL:CQSYDSSLSGVF IGL:TGCCAGTCCTATGACAGCAGCCTGAGTGGGGTGTTC
整合到seurat之中:
1# Slap the AA sequences onto our original table by clonotype_id.
2bcr <- merge(bcr, clono)
3
4# Reorder so barcodes are first column and set them as rownames.
5rownames(bcr) <- bcr[,2]
6#tcr[,1] <- NULL
7
8# Add to the Seurat object's metadata.
9clono_seurat <- AddMetaData(object=seurat_object, metadata=bcr)
10
11head(clono_seurat@meta.data)
12 orig.ident nCount_RNA nFeature_RNA nCount_Protein nFeature_Protein clonotype_id barcode is_cell contig_id high_confidence length chain v_gene d_gene j_gene c_gene full_length
13AAACCTGAGATCTGAA SeuratProject 4386 1206 5092 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
14AAACCTGAGGAACTGC SeuratProject 6258 1723 6480 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
15AAACCTGAGGAGTCTG SeuratProject 4243 1404 3830 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
16AAACCTGAGGCTCTTA SeuratProject 5205 1622 3589 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
17AAACCTGAGTACGTTC SeuratProject 7098 2123 5250 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
18AAACCTGGTATCACCA SeuratProject 1095 604 3573 17 <NA> <NA> <NA> <NA> <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
19 productive cdr3 cdr3_nt reads umis raw_consensus_id frequency proportion cdr3s_aa cdr3s_nt
20AAACCTGAGATCTGAA <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
21AAACCTGAGGAACTGC <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
22AAACCTGAGGAGTCTG <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
23AAACCTGAGGCTCTTA <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
24AAACCTGAGTACGTTC <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
25AAACCTGGTATCACCA <NA> <NA> <NA> NA NA <NA> NA NA <NA> <NA>
为什么这么多NA?因为转录组中的细胞很多,而免疫细胞(BCR)只有一部分。
1 table(rownames(clono_seurat@meta.data) %in% rownames(bcr))
2
3FALSE TRUE
4 6469 762
5> clono_seurat<- subset(clono_seurat,cells = rownames(bcr))
6> head(clono_seurat@meta.data)
7 orig.ident nCount_RNA nFeature_RNA nCount_Protein nFeature_Protein clonotype_id barcode is_cell contig_id high_confidence length chain v_gene d_gene
8CTTTGCGCAAGGACTG SeuratProject 1266 642 696 17 clonotype1 CTTTGCGCAAGGACTG True CTTTGCGCAAGGACTG-1_contig_1 True 479 IGK IGKV3-15 None
9CAGCTAACACCGATAT SeuratProject 935 484 545 17 clonotype1 CAGCTAACACCGATAT True CAGCTAACACCGATAT-1_contig_1 True 549 IGK IGKV3-15 None
10TATTACCGTCGCTTCT SeuratProject 1352 594 422 17 clonotype1 TATTACCGTCGCTTCT True TATTACCGTCGCTTCT-1_contig_1 True 565 IGK IGKV3-15 None
11CTGAAGTGTGAAGGCT SeuratProject 1259 626 450 17 clonotype1 CTGAAGTGTGAAGGCT True CTGAAGTGTGAAGGCT-1_contig_1 True 554 IGK IGKV3-15 None
12CCATGTCAGACCGGAT SeuratProject 1088 555 741 17 clonotype1 CCATGTCAGACCGGAT True CCATGTCAGACCGGAT-1_contig_1 True 556 IGK IGKV3-15 None
13CGTTCTGAGCTCAACT SeuratProject 808 461 4896 17 clonotype1 CGTTCTGAGCTCAACT True CGTTCTGAGCTCAACT-1_contig_1 True 565 IGK IGKV3-15 None
14 j_gene c_gene full_length productive cdr3 cdr3_nt reads umis raw_consensus_id frequency proportion cdr3s_aa
15CTTTGCGCAAGGACTG IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 197 2 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
16CAGCTAACACCGATAT IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 274 2 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
17TATTACCGTCGCTTCT IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 299 2 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
18CTGAAGTGTGAAGGCT IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 250 3 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
19CCATGTCAGACCGGAT IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 324 2 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
20CGTTCTGAGCTCAACT IGKJ2 IGKC True True CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 187 3 clonotype1_consensus_1 7 0.009067358 IGK:CQQYDNWPPYTF
21 cdr3s_nt
22CTTTGCGCAAGGACTG IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
23CAGCTAACACCGATAT IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
24TATTACCGTCGCTTCT IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
25CTGAAGTGTGAAGGCT IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
26CCATGTCAGACCGGAT IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
27CGTTCTGAGCTCAACT IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT
seurat标准过程:
1library(tidyverse)
2clono_seurat %>% NormalizeData() %>% FindVariableFeatures() %>%
3 ScaleData(assay="RNA") %>%
4 RunPCA(npcs = 50,assay="RNA")%>%
5 FindNeighbors(assay="RNA")%>%
6 FindClusters(assay="RNA")%>%
7 RunUMAP(1:30) -> clono_seurat
8
9head(clono_seurat@meta.data)
10
11 orig.ident nCount_RNA nFeature_RNA nCount_Protein nFeature_Protein clonotype_id barcode is_cell
12CTTTGCGCAAGGACTG SeuratProject 1266 642 696 17 clonotype1 CTTTGCGCAAGGACTG True
13CAGCTAACACCGATAT SeuratProject 935 484 545 17 clonotype1 CAGCTAACACCGATAT True
14TATTACCGTCGCTTCT SeuratProject 1352 594 422 17 clonotype1 TATTACCGTCGCTTCT True
15CTGAAGTGTGAAGGCT SeuratProject 1259 626 450 17 clonotype1 CTGAAGTGTGAAGGCT True
16CCATGTCAGACCGGAT SeuratProject 1088 555 741 17 clonotype1 CCATGTCAGACCGGAT True
17CGTTCTGAGCTCAACT SeuratProject 808 461 4896 17 clonotype1 CGTTCTGAGCTCAACT True
18 contig_id high_confidence length chain v_gene d_gene j_gene c_gene full_length productive
19CTTTGCGCAAGGACTG CTTTGCGCAAGGACTG-1_contig_1 True 479 IGK IGKV3-15 None IGKJ2 IGKC True True
20CAGCTAACACCGATAT CAGCTAACACCGATAT-1_contig_1 True 549 IGK IGKV3-15 None IGKJ2 IGKC True True
21TATTACCGTCGCTTCT TATTACCGTCGCTTCT-1_contig_1 True 565 IGK IGKV3-15 None IGKJ2 IGKC True True
22CTGAAGTGTGAAGGCT CTGAAGTGTGAAGGCT-1_contig_1 True 554 IGK IGKV3-15 None IGKJ2 IGKC True True
23CCATGTCAGACCGGAT CCATGTCAGACCGGAT-1_contig_1 True 556 IGK IGKV3-15 None IGKJ2 IGKC True True
24CGTTCTGAGCTCAACT CGTTCTGAGCTCAACT-1_contig_1 True 565 IGK IGKV3-15 None IGKJ2 IGKC True True
25 cdr3 cdr3_nt reads umis raw_consensus_id frequency proportion
26CTTTGCGCAAGGACTG CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 197 2 clonotype1_consensus_1 7 0.009067358
27CAGCTAACACCGATAT CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 274 2 clonotype1_consensus_1 7 0.009067358
28TATTACCGTCGCTTCT CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 299 2 clonotype1_consensus_1 7 0.009067358
29CTGAAGTGTGAAGGCT CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 250 3 clonotype1_consensus_1 7 0.009067358
30CCATGTCAGACCGGAT CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 324 2 clonotype1_consensus_1 7 0.009067358
31CGTTCTGAGCTCAACT CQQYDNWPPYTF TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 187 3 clonotype1_consensus_1 7 0.009067358
32 cdr3s_aa cdr3s_nt RNA_snn_res.0.8 seurat_clusters
33CTTTGCGCAAGGACTG IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
34CAGCTAACACCGATAT IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
35TATTACCGTCGCTTCT IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
36CTGAAGTGTGAAGGCT IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
37CCATGTCAGACCGGAT IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
38CGTTCTGAGCTCAACT IGK:CQQYDNWPPYTF IGK:TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 5 5
1p1 <- DimPlot(clono_seurat)
2p2 <-DimPlot(clono_seurat,group.by = "chain")
3p3 <-FeaturePlot(clono_seurat,features="frequency")
4
5library(patchwork)
6p1+p2+p3
1library(sankeywheel)
2sankeywheel(
3 from = clono_seurat@meta.data$chain, to = clono_seurat@meta.data$v_gene,
4 weight = clono_seurat@meta.data$nCount_Protein,#type = "sankey",
5 title = "sk",
6 subtitle = "chain & v_gene",
7 seriesName = "", width = "100%", height = "600px"
8)
计算不同链的差异基因:
1DT::datatable(FindMarkers(clono_seurat,ident.1 = 'IGK',group.by = "chain"))
2mk<-FindMarkers(clono_seurat,ident.1 = 'IGK',group.by = "chain")
1DotPlot(clono_seurat,features=rownames(mk),group.by = "chain") + RotatedAxis()
1RidgePlot(clono_seurat,features=rownames(mk),group.by = "chain")
References
[1]
Tutorial: Integrating VDJ sequencing data with Seurat: https://links.jianshu.com/go?to=https%3A%2F%2Fwww.biostars.org%2Fp%2F384640%2F[2]
Tutorial: Associating VDJ clonotyping data with scRNA-seq in Seurat: https://links.jianshu.com/go?to=https%3A%2F%2Fwww.biostars.org%2Fp%2F383217%2F
如果你对单细胞转录组研究感兴趣,但又不知道如何入门,也许你可以关注一下下面的课程
生信爆款入门-全球听(买一得五)(第5期)(可能是最后一期) 你的生物信息入门课
(必看!)数据挖掘第3期(两天变三周,实力加量) 医学生/医生首选技能提高课
生信技能树的2019年终总结 你的生物信息成长宝藏
看完记得顺手点个“在看”哦!
长按扫码可关注